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Abstract 

We propose a genetic association analysis using Dirichlet regression to analyze the Genetic Analysis Workshop 18 
data. Clinical variables, arranged in a longitudinal data structure, are employed to fit a multistate transition model 
in which the transition probabilities are served as a response in the proposed analysis. Furthermore, a gene-based 
association analysis via penalized regression is implemented using the markers at a single-nucleotide 
polymorphism level that we previously identified via nonpenalized Dirichlet regression. 



Background 

Genetic association analyses have had tremendous suc- 
cesses in recent years; however, most of these analyses 
were based on binary or continuous responses. Thus we 
propose a multivariate response vector indicating prob- 
abilities of transitions to predefined hypertensive states. 
This enables us to reflect the inherent uncertainty 
involved in the probability that a patient will transfer to 
a given state. 

An important feature of our approach is the incor- 
poration of prehypertension as an intermediate state. As 
Winegarden argues, prehypertension blood pressure in 
young patients helps predict the development of hyper- 
tension [1]. 

Methods 

Definition of response 

We defined a response summarizing the phenotype 
information into a vector that will be used in a 
genetic association analysis. The response is defined 
as a 3-dimensional vector of probabilities 

Y={Yi.Y2,Y3),Y!' = l{yi,Y2'y3)'J2^' = ^^^h that 
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each element measures the probability of a transition to 
a blood pressure level (normotensive, prehypertensive, 
or hypertensive) given the previous level. 

The analysis was done without the knowledge of the 
underlying simulation model and we used the real phe- 
notype data only. 

Data quality control 

Data quality control was performed in PLINK [2]. We only 
considered the data from chromosome 3 for analysis. We 
used a call rate for individuals of 95%, a Hardy- Weinberg 
disequilibrium test at a significance level of 1 x 10" , and a 
missing rate of 95% for each marker. Markers with a 
minor allele frequency of at least 5% were retained for ana- 
lysis. Additionally, all individuals' time points with at least 
1 missing clinical variable were excluded. 

Multistate transition model 

We describe hypertension, our trait of interest, using a 
3-state model based on recorded blood pressure levels 
for each individual at each examination. The states are 
defined as follows: normal blood pressure (state 1) when 
the systolic blood pressure is less than 120 mm Hg and 
diastolic blood pressure is less than 80 mm Hg; prehyper- 
tension (state 2) when the blood pressure level is not in 
state 1, the systolic blood pressure is less than 140 mm 
Hg, and the diastolic blood pressure is less than 90 mm 
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Hg; and hypertension (state 3) for all other cases. Also, if 
a patient used antihypertensive medication, the state 
assigned at that examination is hypertension (state 3) 
regardless of the recorded blood pressure levels. Once 
the states are defined, we consider a multistate transition 
model; it is important to note that all 9 transitions are 
possible. 

Our interest in transition models lies in estimating of 
the transition probabilities as defined in Kalbfleisch and 
Lawless [3] which are given by 

P(S,(t) =} I Si(t-l)=l,Xi{t-l))=yaj{t),l,i e {1,2,3} 

where {S, (r),r = 1,2,...) and {xi(r) = {xn (r) .r,p(r)),r = 1,2,...) 

denote the observed state and the covariates for subject 
/ at the J-* examination respectively. 

This model takes advantage of the longitudinal data 
structure and the definition of the response follows 
naturally. To estimate the transition probabilities, we 
fit a multinomial regression model, based on covariates 
(gender, smoking status and age) and the state at the 
previous examination. 

To get expressions for yu = {yiii,yii2, yiis) J = 1, 2, 3, we 
consider a generalized logit model of the form 

log {Yilj/Yill) = ZiiYipi = 1, 2, 3, j ^ I 

besides, 1 = ym + J'"') = J''" + ^thm '^P^^''"^'^)' 

where Zu = [Xi (I) , time) is the observed vector of covari- 
ates for subject i plus a categorical variable denoting the 
effect of examination time in the model (and possible 
interactions), and Yij is the vector of coefficients for the 
corresponding multinomial regression model. 

Thus, a transition probability matrix (TPM) is defined 
for each individual as follows 



1 




exp(ziiyi3) 


exp(zi2y2i) 


1 




1 + T.j.i.ij'^ expiziiYij) 




1 






1 + Ejii expizay,^) 



Therefore, the response for subject / is a row taken 
from TPM, and is determined by conditioning on the 
patient's last available observed state and covariates. 

Dirichlet regression 

Once the response is modeled our objective is to 
determine whether there is an association between it 
and the genotypes. We assess this association using 
Dirichlet regression [4], which suits this response 
structure. The advantage of this approach lies in its 
tractability in dealing with the proposed response. It 
also allows a more comprehensive understanding of 
the genetic effect on the expression of hypertension, 



and therefore in its possible interpretation. For 
instance, if a signal was detected for a marker, it 
would suggest an association between the marker and 
the transition of blood pressure states jointly rather 
than a single level. Therefore, the Dirichlet approach is 
more informative in the sense of explaining the plausi- 
bility of each defined state. 

To relate the genetic information and the defined 
response under a Dirichlet regression approach, the like- 
lihood given each individual's vector of covariates, s,, is 



n:., 



r(AW)n 



A.j(sO-l 

Yij 



J^f r (XjitextbfSi)) 



where Xj (s,) = Ay > 0, A (s,) = Aj = ^ ^ Xj (sj) and 

r(-) is the gamma function. 

The parameters, Xj (s;), are defined in terms of a linear 
predictor using a logarithm link, 

log {Xj (s,)) = log (Xij) = Pm^i'" = S'Ai = 1' 2, 3 

where M is the number of covariates included in the 
model and Pj is the vector of regression coefficients that 
explains the effects (in log scale) of the covariates on 
the 7* component. 
Considering the above, 2 models are analyzed: 
Model 1 (Ml): log (Ay) = a^^ + fi^^g'^ibase model) 
Model 2 (M2): log (Ay) = a^^ + pf^^ + FAMi^j^^ 
(adjusted model) 

Here ^ represents the number of copies of the minor 
allele on the single-nucleotide polymorphism (SNP) for 
the individual under an additive genetic model; FAM; 
is the row of contrast matrix for the pedigree number 

considered as a categorical variable and 9^ = (aj*, /jj', d^^^f 
is the vector of regression coefficients on the /* compo- 
nent. (Note 1 = 0). 

Our interest in these models lies in the potential 
genetic effect of each marker on the proposed response. 
To assess this, Wald statistics were used to test the null 
hypothesis of no association between each SNP and the 
response, Hq : ^ = 0 (vs.Ha : notHo), jS = {Pi, Pi, Ps). 

Gene-based association 

Once we identify significant SNPs through the genetic 
association analysis as described above, we proceed to 
perform the analysis at a gene level. To achieve this, we 
propose a penalized regression. Including all the mar- 
kers simultaneously, this penalized method aims to 
select those SNPs with higher association. The analysis 
is done on those candidate genes that contain at least 1 
significant marker that has already been determined. 
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Variable selection on the SNPs is assessed via a pena- 
lized likelihood of the form 

pKi); Y,G,c,K) = l in; Y, G) - ck J^l^ -^Mi ~ {I - c)k J^^^ ||>).,|Ii 

where I (??; Y, G) represents the log-likelihood of a 
dirichlet distributed sample with response matrix 
G = {g\, ...,gl,YG= {g\, . . .,gl,Y (or G : FAM for M2) 
is the design matrix, gi = (1,^/, . . . ,^); P denotes the 
number of markers considered for variable selection; k 
is the number of states; f] is the regression coefficients 
vector; c and X. are parameters for the penalized regres- 

sion; and WrjiWj = ( ^fjj ^'^^ penalty 

norms. It is important to note that when c = 1 we have 

a ridge regression penalty, whereas when c = 0 we have 
a lasso penalty. We implement the variable selection for 
the penalized dirichlet regression using R code provided 
on the Statistical Genetics and Genomics Laboratory at 
the University of Pennsylvania webpage [5]. 

Results 

Data quality results 

The Genetic Analysis Workshop 18 (GAW18) data con- 
sists of 855 individuals with genotype and phenotype 
information. As a result of missing data, transition prob- 
abilities are estimated for only 835 individuals. Of these, 
43 are removed because of low call rate. The overall call 
rate for the remaining 792 individuals is 99.82%. 

The Genome Wide Association Study (GWAS) data 
includes 65,519 SNPs for chromosome 3, of which 59 
are excluded because it is not possible to reliably obtain 
position information for these markers. The remaining 
65,460 SNPs are considered for data quality control. 
Because of a low genotyping rate, 114 markers are 
removed; none are excluded by the Hardy- Weinberg 
equilibrium test; and 13,011 markers are removed 
because of low minor allele frequency. The remaining 
52,357 markers are considered for analysis. 

Analysis results 

The parameter estimates for the transition models are 
obtained using R [6]. To test examination time effect; 
likelihood ratio tests are performed in which the null 
model considers only the available clinical variables. 
Table 1 presents the final transition models. 



Table 1 Selected transition models 



Transition 




Model 






log(yii/yii) = 


rijO + (rijlXset + YljlXsmoh! + YlpXage) * time] = 


2,3 




log {yij/Yii) = 


^ YljO + YljlXsex + Y2i2Xsnwlw + Y^jiXage + time j = 


1,3 




log(y3j/y33) 


= YilO + YsjlXsex + Y3j2Xmmke + YipXagej = 


1,2 



After the response is estimated, models Ml and M2 
are fit using R [7] for each available SNP. Figure 1 dis- 
plays the Manhattan plots for the p values that result 
from testing the null hypotheses of no association 
between the markers and the response. The graphs 
show that only 1 marker under M2 is significant at the 
standard significance level for GWAS (5 x 10"^). Inter- 
estingly, the same marker is the most significant marker 
under Ml, although it is not significant at the standard 
threshold. This suggests that the adjustment for family 
incorporated in M2 accounts for the family structure in 
the data. Also, the proposed methodology demonstrates 
consistency in that the same marker proves to be the 
most significant under both models. Table 2 summarizes 
these findings. 

Once significant markers were identified, a gene-level 
association analysis is performed using the penalized 
regression described above for different levels of c (0, 0.3, 
0.5, 0.7, and 1). The analysis is conducted utilizing both 
the GWAS and the dosage imputed genotypes (GENO) 
information as the explanatory variables. Figure 2 shows 
the penalized regression results for the gene containing 
the significant SNP (rsl2492830) for c = 0.5 only. This 
level of c is a blended penalty function, equally weighting 
the ridge and lasso penalties. Table 3 shows the results 
for different levels of c under M2 for gene PCCB. 

Discussion 

The present work implements a multistate transition 
model that conveniently accommodates the longitudinal 
data structure. Whether the information contained by 
the available clinical variables is sufficient for predicting 
the hypertensive state is debatable, however. 

Although the adjusted model (M2) is an improvement 
over the base model (Ml), neither of the described 
models accounts for correlation between individuals nor 
heteroscedasticity. One way to possibly overcome this is 
to incorporate a latent variable into the model. Such an 
extension follows. 

Model 3 (M3):log(>.i,) = af^ + P^^^ + m; where w, is 
the element of a vector u that follows a MVN(0, K) 
distribution; here K is twice the estimated kinship 
matrix. In this case, however, the estimation of the para- 
meters of interest, Pj, is not straightforward. Further 
research of this methodology is warranted. 

With respect to the penalized regression, to avoid an 
arbitrary selection of c,a cross-validation method could 
be implemented. 

Conclusions 

We propose a methodology that conveniently uses the 
longitudinal data structure to define a probabilistic out- 
come, which, we believe, explains hypertension in a more 
suitable way. Dirichlet regression provides an interesting 
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Table 2 Association analysis results 



SNP Gene 


MA 


MAF (%) 


p Value (M1) 


p Value (M2) 




rs 12492830 PCCB 


C 


722 


1.1 X 10"' 


3.9 X 10"*^ 




MA, minor allele MAf, minor allele frequency. 



Gene: PCCB 



o 



GWAS 
GENO 



J I 



n 1 1 1 1 1 

laSMOKb 135960 Kb 135900 Kb 136000 Kb 136020 Kb 13604010) 



Position 



Figure 2 Penalized regression results for M2 only 



Table 3 Comparison of penalized regression under different levels of c 



No. of parameters selected (iterations for convergence) 


Data 


# SNPs 


C = 


0 


0.3 0.5 0.7 


1 


GWAS 


22 




1 0 (260) 


10 (148) 8(135) 7 (163) 


7 (174) 


GENO 


607 




42 (427) 


35 (199) 24 (176) 25(136) 


19 (115) 
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approach that, along with other more common responses, 
can be successfully used in genetic association analysis. 
Our model finds a statistically significant marker at the 
standard significant level for GWAS, which is noteworthy, 
considering that it is often difficult to find association. 
Moreover, when the penalized method is used on the 
GENO data we are able to find significant markers in 
addition to those have already found using GWAS data. 
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